Systematic literature review

Vincent Bagilet https://www.sipa.columbia.edu/experience-sipa/sipa-profiles/vincent-bagilet (Columbia University)https://www.columbia.edu/ , Léo Zabrocki https://www.parisschoolofeconomics.eu/en/ (Paris School of Economics)https://www.parisschoolofeconomics.eu/en/
2020-11-25

Purpose of the document

In the present document, we aim to conduct a systematic review of the literature on health effects of air pollution. We carry out an extremely simple text mining analysis of this literature in order to have a sense of the proportion of articles studying health effects of air pollution data taking the issue of missing data into account.

Retreiving article texts

To carry out this analysis, we use the fulltext package to get the full text of each article whose title contains the words “air pollution” and “health”. This set of articles might be too restrictive and we may want to broaden our scope later.

To begin with, we focus on articles published on Scopus and Pubmed. To access Scopus API, one needs to register to get an API key (stored in the .Renviron) for Elsevier and a Crossref TDM API key. Note that downloading of full texts may not work if one is not connected directly to their institution internet network. Pubmed articles are accessed via Entrez. An API key enables to increase the number of requests per seconds from 3 to 10. More information on authentification is available on the fulltext manual.

Set of articles to consider

First of all, we need to clearly define the set of articles we want to consider in this analysis. Our baseline search query is: "air pollution" AND ("emergency" OR "mortality") AND ("particulate matter" OR ozone OR "nitrogen dioxyde" OR "sulfur dioxyde"). This returns about 57000 articles (without accounting for duplicates in between journals). This number being rather large, we want to narrow it down. To focus on the epidemiological literature, we add the query term poisson. To get causal analyses, we add the term causal.

#set folder to store downloads
cache_options_set(full_path = "/Users/vincentbagilet/Documents/fulltext")
# cache_options_set(full_path = "/Volumes/VincentSSD/Research_data/air_pollution/lit_review/full_texts")

query <- 'TITLE(("air pollution"  OR "air quality" OR "particulate matter" OR ozone OR "nitrogen dioxide" OR "sulfur dioxide") AND ("emergency" OR "mortality") AND NOT ("long term" or "long-term")) AND ("particulate matter" OR ozone OR "nitrogen dioxide" OR "sulfur dioxide")'# AND (poisson OR causal)'

#Run a search
search <- ft_search(query, from = c("scopus", "entrez"), limit = 50)
# search <- ft_search(query, from = "scopus", limit = 2000)
# search_entrez <- ft_search(query, from = "entrez", limit = 300)


metadata_scopus <- search$scopus$data %>% 
  as_tibble() %>% 
  rename_all(function(x) str_remove_all(names(.), "prism:|dc:")) %>% 
  rename_all(function(x) str_replace_all(names(.), "-", "_"))

metadata_entrez <- search$entrez$data %>% #search_entrez$entrez$data 
  as_tibble() %>% 
  rename(id = uid)

metadata_scopus_short <- metadata_scopus %>% 
  select(doi, title, creator, publicationName, pubmed_id, coverDate) %>% 
  rename(
    authors = creator,
    journal = publicationName
  ) %>% 
  mutate(
    pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
    pub_date = ymd(coverDate)
  ) %>% 
  select(-coverDate)

metadata_entrez_short <- metadata_entrez %>% 
  select(doi, title, authors, fulljournalname, pmid, pubdate) %>% 
  rename(
    journal = fulljournalname,
    pubmed_id = pmid
  ) %>% 
  mutate(
    pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
    pub_date = ymd(pubdate)
  ) %>% 
  select(-pubdate)

metadata_lit_review <- metadata_scopus_short %>% 
  rbind(metadata_entrez_short) %>% 
  filter(!is.na(doi))

saveRDS(metadata_lit_review, "../Outputs/metadata_lit_review.RDS")

The metadata from different sources having different shapes, we only select a few relevant columns to build an overall metadata set.

The full list of articles studied here is as follows:

metadata_lit_review <- readRDS("../Outputs/metadata_lit_review.RDS")

metadata_lit_review %>% 
  select(authors, pub_date, title, journal) %>% 
  datatable()

Retreiving abstracts

First, we focus on the abstracts and download them.

There is no fulltext function to access abstracts from Entrez. Therefore, using the DOI, we get the abstracts from Semantic Scholar. We also access Scopus abstracts from Semantic Scholar since, due to my IP address being located outside of Columbia, I cannot access the texts and abstracts from Scopus.

In Semantic Scholar, there is a rate limit of 100 articles per 5 min or 20 articles per minute. We therefore need to pause the system to be able to download everything.

# opts_scopus <- list(key = Sys.getenv('ELSEVIER_SCOPUS_KEY'))

# abstracts_scopus <- metadata_scopus$doi %>% 
#   ft_abstract(from = "scopus", scopusopts = opts_scopus) 

get_abstracts <- function(doi) {
  vect_doi <- unique(doi)
  number_periods <- (length(vect_doi) - 1) %/% 20
  # abs <- NULL
  abs_ls <- NULL

  message(str_c("Total downloading time: ", number_periods, "min"))
  
  for (i in 0:number_periods) {
    
    doi_period <- vect_doi[(20*i+1):(20*(i+1))]
    doi_period <- doi_period[!is.na(doi_period)]
    
    abs_ls[[i + 1]] <- doi_period %>% 
      ft_abstract(from = "semanticscholar") %>%
      .$semanticscholar 
    
    # abs_period <- doi_period %>% 
    #   ft_abstract(from = "semanticscholar") %>% 
    #   .$semanticscholar %>% 
    #   as_tibble() %>% 
    #   unnest(cols = everything()) %>% 
    #   pivot_longer(everything(), names_to = "doi", values_to = "abstract") %>% 
    #   filter(doi != abstract)
    # 
    # abs <- abs %>% 
    #   rbind(abs_period)
    
    if (i < number_periods & number_periods != 0) {
      message(str_c("Remaining time: ", number_periods - i, "min"))
      Sys.sleep(63)
    }
  } 
  
  return(abs_ls)
}

abstracts_f <- metadata_lit_review$doi %>% #[c(1:50, 1700:1750)] 
  get_abstracts() 

abstracts_loop <- NULL
for (i in seq_along(abstracts_f)) {
  new_abstracts <- abstracts_f[[i]] %>%
    as_tibble() %>% 
    unnest(cols = everything()) %>%
    pivot_longer(everything(), names_to = "doi", values_to = "abstract") %>%
    filter(doi != abstract)
  
  abstracts_loop <- abstracts_loop %>% 
    rbind(new_abstracts)
}
abstracts <- abstracts_loop %>% 
    left_join(metadata_lit_review, by = "doi") %>% 
    distinct(title, .keep_all = TRUE)

saveRDS(abstracts, "../Outputs/abstracts.RDS")

Full texts

We then download the full texts (using the ft_get functions). The full texts are stored in an xml format in the cache directory. Note that due to my IP address being located outside of Columbia, I cannot access the texts from Scopus.

metadata_scopus$doi %>% 
  ft_get(progress = TRUE)

metadata_entrez$doi %>% 
  ft_get(from = "entrez", progress = TRUE, type = "pdf")

Once we have downloaded all the files, we can put them into a table format, before analyzing them.

#put text into a table format
article_texts <- ft_table() %>% #searches for documents in the cache directory 
  select(-paths, -ids_norm) %>% 
  .$text 

saveRDS(article_texts, "../Outputs/article_texts.RDS")

Analysis

Abstracts analysis

First, we briefly explore the main themes of the articles.

abstracts <- readRDS("../Outputs/abstracts.RDS")

abstracts %>%
  unnest_tokens(word, abstract, to_lower = TRUE) %>% 
  anti_join(tidytext::stop_words) %>%
  count(word) %>%
  with(wordcloud::wordcloud(word, n, max.words = 100, random.color = TRUE))

Retreiving effects and confidence interavals

Then we want to extract the effects, odds ratio and associated confidence intervals in the abstracts. To identify the effect and CI we proceed as follows:

effects_CI <- abstracts %>% 
  mutate(
    abstract = str_replace_all(abstract, "·", ".")
  ) %>% 
  select(doi, abstract) %>% 
  unnest_tokens(sentence, abstract, token = "sentences", to_lower = FALSE, drop = FALSE) %>% 
  # mutate(
  #   contains_CI = str_detect(sentence, "(95%|\\bCI\\b|\\b(c|C)onfidence (i|I)nterval\\b)")
  # ) %>%
  # filter(contains_CI) %>%
  mutate(
    effect = str_extract_all(sentence, "[−\\d.-]+(?=[^\\d.]{0,7}(95%|\\bCI|\\b(c|C)onfidence (i|I)nterval\\b))(?<!\\b95)"),
    CI = str_extract_all(sentence, "(?<=(95%|\\bCI\\b|\\b(c|C)onfidence (i|I)nterval\\b)[^\\d.]{0,4})[−\\d.-]+[^\\d.]{0,4}[−\\d.-]+")
  ) 

#In effect_OR, could replace [^\\d.] by [\\s:;,\\(\\[%] to be more precise

#previous version
# %>%
#   mutate(
#     effect = str_extract_all(sentence, "(?<=\\b(with a |with ))[\\d.]+"),
#     bound_CI = str_extract_all(sentence, "(?<=\\bCI.{0,4})[−\\d.-]+[\\s,-–]{0,3}[−\\d.-]+"),
#     OR = str_extract_all(sentence, "(?<=\\bOR.{0,4})[\\d.]+")
#   )

Note that, we could add something to detect CI and effects which are as “18.9% (-8.7, 54.7)”, ie when “CI” is not repeated.

Once the effects and CI are identified, some wrangling is necessary in order to get the data into a usable format.

effects_CI_clean <- effects_CI %>% 
  unnest(effect, CI, keep_empty = TRUE) %>% 
  separate(CI, into = c("low_CI", "up_CI"), "([\\s,]+)|(?<!^)[-–]") %>% 
  mutate(across(c("effect", "low_CI", "up_CI"), .fns = as.numeric)) %>% 
  mutate(
    low_CI = ifelse(is.na(up_CI), NA, low_CI),
    up_CI = ifelse(is.na(low_CI), NA, up_CI),
    effect = ifelse(is.na(low_CI), NA, effect),
    pb_or_not_significant = (effect <= low_CI | effect >= up_CI)
  ) %>% 
  select(-sentence, -abstract) %>% 
  filter(!is.na(effect))

articles_effect <- abstracts %>% 
  left_join(effects_CI_clean, by = "doi")

articles_effect %>% 
  group_by(doi) %>% 
  summarise(has_effect = mean(effect, na.rm = TRUE)) %>% 
  count(ret = !is.nan(has_effect)) %>% 
  mutate(ret = ifelse(ret, "Yes", "No")) %>% 
  kable(col.names = c("Effect retreived", "Number of articles"))
Effect retreived Number of articles
No 66
Yes 27

Retreiving study period

It might also be interesting to information about the study period. To do so, we also use regex.

month_regex <- "\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)"
date_regex <- str_c("(",month_regex, "\\s){0,1}(19|20)\\d{2}")

abstracts_date <- abstracts %>% 
  mutate(
    dates_obs = str_extract_all(abstract, str_c(date_regex, "( to |\\s?—\\s?|\\s?-\\s?| and )", date_regex))
  ) %>% 
  separate(dates_obs, into = c("begin_obs", "end_obs"), "( to |\\s?—\\s?|\\s?-\\s?| and )") 

Power analysis

In this section, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. To do so, we use the package retrodesign and suppose that the true effect is a fraction of the measured effect. To use the retro_design() function, we first need to compute the standard error of the estimate. Probably due to rounding effect, we often do not get the same value for the standard error if we compute it using the upper or the lower bound of the CI. Thus, we average across the two values obtained.

articles_retro <- articles_effect %>% 
  mutate(
    se_up = (up_CI - effect)/1.96,
    se_low = (low_CI - effect)/(-1.96),
    se = (se_up + se_low)/2
  ) %>% 
  select(-se_low, -se_up) %>% 
  filter(!is.na(effect)) %>% 
  mutate(
    # retro_0.01 = as.tibble(retro_design(effect*0.01, se)),
    # retro_0.05 = as.tibble(retro_design(effect*0.05, se)),
    # retro_0.1 = as.tibble(retro_design(effect*0.1, se)),
    retro_0.5 = as.tibble(retro_design(effect*0.5, se)),
    retro_0.67 = as_tibble(retro_design(effect*0.67, se)),
    retro_0.75 = as_tibble(retro_design(effect*0.75, se)),
    # retro_0.9 = as_tibble(retro_design(effect*0.9, se)) 
  ) %>% 
  pivot_longer(
    starts_with("retro"), 
    names_to = "prop_true_effect", 
    values_to = "computed"
  ) %>% 
  mutate(
    power = computed$power,
    typeS = computed$typeS,
    typeM = computed$typeM,
    prop_true_effect = as.numeric(str_sub(prop_true_effect, 7, nchar(prop_true_effect)))
  ) %>% 
  select(-computed)

Then, we quickly explore the results. First, we look at the distribution of power, type M and type S error across simulation and for different size of true effect

articles_retro %>% 
  ggplot(aes(x = power)) + 
  geom_histogram(bins = 10) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Distribution of power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect")
  # geom_density()

articles_retro %>% 
  ggplot(aes(x = typeM)) + 
  geom_histogram(bins = 10) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Distribution of type M error in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect")
articles_retro %>% 
  ggplot(aes(x = typeS)) + 
  geom_histogram(bins = 10) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Distribution of type S error in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect")

Then, we look at the relation between power, type M and type S error and true effect size.

articles_retro %>% 
  ggplot() +
  geom_point(aes(x = power, y = typeM)) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Link between type M and power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect", x = "Power", y = "Type M")
articles_retro %>% 
  ggplot() +
  geom_point(aes(x = power, y = typeS)) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Link between type S and power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect", x = "Power", y = "Type S")
articles_retro %>% 
  filter(typeM < Inf) %>% 
  group_by(prop_true_effect) %>% 
  summarise(typeM = mean(typeM, na.rm = TRUE)) %>% 
  ggplot() +
  geom_point(aes(x = prop_true_effect, y = typeM)) +
  labs(title = "Link between type M and 'true effect' in retrodesign simulations", x = "True effect as a proportion of the measured effect", y = "Type M")
articles_retro %>% 
  group_by(prop_true_effect) %>% 
  summarise(typeS = mean(typeS, na.rm = TRUE)) %>% 
  ggplot() +
  geom_point(aes(x = prop_true_effect, y = typeS)) +
  labs(title = "Link between type S and 'true effect' in retrodesign simulations", x = "True effect as a proportion of the measured effect", y = "Type S")

We can also look at how power, type M and type S error evolved with publication date.

articles_retro %>% 
  group_by(year = year(pub_date), prop_true_effect) %>%
  mutate(power = mean(power, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_point(aes(x = year, y = power)) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Evolution of power with publication date", x = "Publication date", y = "Power")
articles_retro %>% 
  group_by(year = year(pub_date), prop_true_effect) %>%
  mutate(typeM = mean(typeM, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_point(aes(x = year, y = typeM)) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Evolution of type M with publication date", x = "Publication date", y = "Type M")
articles_retro %>% 
  group_by(year = year(pub_date), prop_true_effect) %>%
  mutate(typeS = mean(typeS, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_point(aes(x = year, y = typeS)) +
  facet_wrap(~ prop_true_effect) +
  labs(title = "Evolution of type S with publication date", x = "Publication date", y = "Type S")

Full texts analysis

To analyse the texts, we first start by simply exploring the proportion of articles mentioning the words “missing” and “power”.

readRDS("../Outputs/article_texts.RDS")

article_texts %>%
  summarise(
    contains_missing = sum(str_detect(text, pattern = "\\bmissing"))/n(), #could put a vector with various terms
    contains_power = sum(str_detect(text, pattern = "\\bpower\\b"))/n() #statistical power
  )